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ABSTRACT 


Embedding-based Retrieval Models (ERMs) have emerged as a 
promising framework for large-scale text retrieval problems due to 
powerful large language models. Nevertheless, fine-tuning ERMs to 
reach state-of-the-art results can be expensive due to the extreme 
scale of data as well as the complexity of multi-stages pipelines (e.g., 
pre-training, fine-tuning, distillation). In this work, we propose the 
PEFA framework, namely ParamEter-Free Adapters, for fast tun- 
ing of ERMs without any backward pass in the optimization. At 
index building stage, PEFA equips the ERM with a non-parametric 
k-nearest neighbor (kKNN) component. At inference stage, PEFA 
performs a convex combination of two scoring functions, one from 
the ERM and the other from the kNN. Based on the neighborhood 
definition, PEFA framework induces two realizations, namely PEFA- 
XL (i.e., extra large) using double ANN indices and PEFA-XS (i.e., 
extra small) using a single ANN index. Empirically, PEFA achieves 
significant improvement on two retrieval applications. For docu- 
ment retrieval, regarding Recall@100 metric, PEFA improves not 
only pre-trained ERMs on Trivia-QA by an average of 13.2%, but 
also fine-tuned ERMs on NQ-320K by an average of 5.5%, respec- 
tively. For product search, PEFA improves the Recall@100 of the 
fine-tuned ERMs by an average of 5.3% and 14.5%, for PEFA-XS and 
PEFA-XL, respectively. Our code is available at https://github.com/ 
amzn/pecos/tree/mainline/examples/pefa-wsdm24 
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1 INTRODUCTION 


Given a user’s query, large-scale text retrieval aims to recall a 
match set of semantically relevant documents in real-time from 
an enormous corpus, whose size can be 100 millions or more. 
Embedding-based retrieval models (ERMs) [6, 29, 58], namely bi- 
encoders [20, 43], have emerged as the prevalent paradigm for large- 
scale text retrieval, thanks to recent advances in large language 
models (LLMs). At the learning stage, ERMs fine-tune paramet- 
ric Transformer encoders that map queries and documents into 
a semantic embedding space where relevant (query, document) 
pairs are close to each other and vice versa. At the inference stage, 
retrieving relevant documents from the enormous output space 
can be formulated as the maximum inner product search (MIPS) 
problem [60]. With proper indexing data structures, MIPS problem 
can be efficiently solved by approximate nearest neighbor (ANN) 
search libraries (e.g., Faiss [25], ScaNN [15], HNSWLIB [39]) in time 
sub-linear to the size of corpus. 

Adapting ERMs to downstream retrieval tasks usually follows 
the full-parameter fine-tuning paradigm, which requires gradient 
computations and updates parameters of Transformer encoders. 
Such full-parameter fine-tuning approach faces challenges in the 
industrial setup, where learning signals are enormous. In modern 
e-commerce stores, for example, the number of relevant (query, 
product) pairs can be billions or more. Full-parameter fine-tuning 
ERMs on such scale may take thousands of GPU hours due to 
complicated multi-stage pipeline: pre-training [6, 12, 13], 1st stage 
fine-tuning with random negatives and BM25 candidates [29], 2nd 
stage fine-tuning with hard-mined negatives [58, 62], and 3rd stage 
fine-tuning with distilled knowledge from expensive cross-attention 
models [48, 63]. Furthermore, these fine-tuning approaches require 
access to models’ gradient information, which is not accessible for 
many black box LLMs such as GPT-3 [4] and beyond. 

In this work, we propose the PEFA framework (i.e., ParamEter- 
Free Adapters) for fast tuning of black-box ERMs, which doesn’t 
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require any gradient information of the model. The scoring function 
of PEFA is a convex combination between the ERM and the new non- 
parametric k-nearest neighbor (kNN) model. The learning of KNN 
model reduces to constructing ANN index that stores key-value 
pairs of query embeddings and learning signals. Given a query at 
inference time, the kNN model seeks close-by training queries in the 
neighborhood, and aggregates the associated relevant documents as 
its scoring function. Depending on the definition of neighborhood, 
we introduce two kKNN models under our PEFA framework: 


e PEFA-XS: the neighborhood is defined by the relevant query- 
document pairs, which is independent to the test-time query. 

e PEFA-XL: the neighborhood is an intersection of the one 
in PEFA-XS and kNN queries in the training set, which is 
dependent to the test-time query. 


In Summary, we highlight four key contributions below. 


e We propose PEFA, a novel Parameter-free adapters frame- 
work for fast tuning ERMs to downstream retrieval tasks. 

e PEFA requires no gradient information of ERMs, hence ap- 
plicable to black-box ERMs. 

e PEFA is not only applicable to a wide-range of pre-trained 
ERM, but also effective to fine-tuned ERMs. 

e We demonstrate the effectiveness and scalability of PEFA 
on two retrieval applications, including document retrieval 
tasks and industrial-scale product search tasks. 


For document retrieval, PEFA not only improves the recall@100 of 
pre-trained ERMs on Trivia-QA by an average of 13.2%, but also 
lifts the recall@100 of fine-tuned ERMs on NQ-320K by an average 
of 5.5%. For NQ-320K dataset, applying PEFA to the fine-tuned 
GTRbase [42] reaches new state-of-the-art (SoTA) results, where the 
Recall@10 of 88.71% outperforms the Recall@10 of 85.20% in the 
previous SoTA Seq2Seq-based NCI [56], under similar model size 
for fair comparison. For product search consisting of billion-scale of 
data, PEFA improves the Recall@100 of the fine-tuned ERMs by an 
average of 5.3% and 14.5%, for PEFA-XS and PEFA-XL, respectively. 


2 PRELIMINARY 


2.1 Dense Text Retrieval 


Dense text retrieval typically adopts the Embedding-based Retrieval 
Model (ERM) architecture, also known as bi-encoders [6, 29, 58]. 
For simplicity, we use the term passage/document interchangeably 
in the rest of paper. Given a query q € X and a passage p € X, the 
relevance scoring function fgrm(q, p) of the ERM is measured by 


Ferm (4 p: 9) = (E(q; 0), E(p; 0)), (1) 


where E(-;0) : X — R’ is the encoder parameterized with 0 
that maps an input text to a d-dimensional vector space and (-, -) : 
RxR —> Ris the similarity function, including inner product and 
cosine similarity. Without loss of generality, we use inner product 
as the scoring function for the rest of paper. 


Learning. Suppose the training data is presented as a set of 
relevant query-passage pairs D = (qn pa}. The encoder pa- 
rameters 0 are often learned by maximizing the log-likelihood loss 
function [6, 29] maxg Di(g.p)ep log po (plq), where the conditional 
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probability is defined by the Softmax function 


exp (ferm(q p; 9)) 
Zpen exp (ferma p’; )) 
In practice, various negative sampling techniques [11, 29, 34, 58] 
have been developed to approximate the expensive partition func- 


tion of the conditional Softmax. We direct interested readers to the 
comprehensive study [14] for more details of learning ERMs. 


Po(Plq) = 


Inference. Given a query embedding q € Rf anda corpus of 
n passage embeddings P = {Pj} j= where pj € RI, j =1,...,n, 
ERMs retrieve k most relevant passages from F in real time, which 
is a Maximum Inner Product Search (MIPS) problem. Exact infer- 
ence of MIPS problem requires O(n) time, which is prohibited for 
large-scale retrieval applications. Thus, practitioners leverage Ap- 
proximate Nearest Neighbor search (ANN) to approximately solve 
it in time sub-linear (e.g., O(log(n))) to the size of corpus n. 

To achieve sub-linear time complexity of ANN search, ANN 
methods require an additional index building stage to preprocess the 
corpus P into specific data structures, such as hierarchical graphs 
(e.g., HNSW [39], VAMANA [23], etc) and product quantization (e.g., 
FAISS [25], ScaNN [15], etc). Compared to the cost of full-parameter 
fine-tuning ERMs on GPU machines, the cost of building ANN index 
is often negligible as the latter takes place on the lower-cost single 
CPU machine, with much faster computational time. 


2.2 Problem Statement 


Notations. Y € {0, 1}”*” is the query-to-passage relevant ma- 
trix, namely the supervised training data. The row indices of Y refer 
to the set of queries Q, and the column indices of Y refer to the set 
of passages P, respectively. Note that P is also the corpus space 
used for ANN inference of ERMs. The bold matrices P € R”*@ and 
Q € R™*4 denote the query and passage embeddings of P and 
Q obtained from the ERM, respectively. We denote Y;,; € {0,1}” 
as the relevant vector of ith query qj, representing which set of 
passages in P are relevant to this query qj. Similarly, Y. ; € {0, 1}’” 
is the relevant vector of jth passage qj, representing which set of 
queries in Q are relevant to this passage pj. Finally, q € Rf is the 
query embeddings at inference time and NN(q, P; k) is the set of k 
nearest indices in the indexed database P given q. 


Problem Setup. In this work, we propose PEFA, parameter-free 
adapters for ERMs via equipping it with a non-parametric kNN 
component. The non-parametric kNN model is learning-free which 
avoid any optimization step to fine-tune parameters of ERMs. The 
major computation of PEFA becomes building ANN index storing 
key-value pairs for serving efficiently at the inference stage. Thus, 
our PEFA is also applicable to both pre-trained and fine-tuned 
ERMs, even ones initialized from black-box LLMs. Note that PEFA 
is orthogonal and complement to most existing literature that aims 
to obtain better pre-trained or fine-tuned ERMs at the learning stage, 
including recent studies of the parameter-efficient fine-tuning of 
ERMs [28, 37, 44]. Finally, for the ease of discussion, we assume 
embeddings obtained from ERMs are unit-norm (i.e., £2 normalized), 
hence the inner product is equivalent to the cosine similarity. The 
techniques proposed in this paper can be easily extended to non- 
unit norm cases by replacing the distance metric used in kKNN. 
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Figure 1: Illustration of the proposed PEFA-XL method. The 
green and blue dash line represents computation paths for 
building ANN index in offline. The red solid line represents 
computation paths for performing ANN search in online. PEFA- 
XL requires two ANN indices: one on the passage space P for 
fast inference of ERM, while the other on the query space Q 
for fast inference of KNN model. 


3 PROPOSED FRAMEWORK 


In this section, we propose PEFA, a Parameter-free Adapters frame- 
work for fast tuning of ERMs. Given a query embedding q at in- 
ference time, our PEFA framework defines the relevant scoring 
function of a query-passage pair (q, pj) as the convex combina- 
tion between scoring functions of the black-box ERM and a non- 
parametric KNN model: 


forra lå, Pj) =A; ferm(@ pj) + (1-A) > fiorn(@ pj), 2) 


where A € [0,1] is the interpolation hyper-parameter to balance 
the importance between the ERM and the kNN model. Note that the 
proposed PEFA framework is learning-free. In other words, the un- 
derlying parameters 0 of ERM remains unchanged, and Equation 2 
is only applied at the inference time. 

Next, we present scoring functions of KNN models in a generic 
form as follows. 


Senn (4: Pj) = (4.2' D(G, Q)Y. j) (3) 


where D(q, Q) € R”*” is a normalized diagonal matrix, acting like 
a gating mechanism that controls which set of training queries the 
current test query q should pay attention to. 

Plugging Equation 3 back to Equation 2, we derive the scoring 
function of PEFA explicitly 


forra lâ Pj) = A(q pj) + 1 - ANG QD QY) (4) 


Based on the design of diagonal matrix D(q, Q), we present two 
realizations of KNN models under the PEFA framework, namely 
PEFA-XL (Section 3.1) and PEFA-XS (Section 3.2). We then discuss 
their intuitions, time and space complexity, and connections to the 
related literature. For the rest of the paper, We use HNSW [39] as the 
underlying ANN methods in our PEFA framework for complexity 
analysis and experiment results. 
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Figure 2: Illustration of the proposed PEFA-XS method. The 
green and blue dash line represents computation paths for 
building ANN index in offline. The red solid line represents 
computation paths for performing ANN search in online. PEFA- 
XS requires only one ANN index. As the neighborhood of kKNN 
becomes independent to q, interpolation of two score functions 
can be pre-computed in the embedding space, when building 
the ANN index offline. 


3.1 PEFA-XL 


A standard realization of the kNN model is that the test query q 
only pays attention to top-k most similar training queries in Q. 
Specifically, Di; = 1 if i € NN(q, Q; k); otherwise D; į = 0. We can 
then derive the kNN model of Equation 3 as following: 


2 


i€NN(q,Q;k) 


finn (@ Pj) = (4. >) @ii¥i,j)-42) = (4.4i)°Yi,j- (5) 
i=1 


By plugging the query-aware kNN model of Equation 5 into Equa- 
tion 2, we present the scoring function of PEFA-XL explicitly 


Dy 


i€¢NN(q,Q;k) 


ferra-xL (4 Pj) =A(q, pj) + 0-A) (4.4i) -Yij- (6) 


Implementation. An illustration of PEFA-XL method is pre- 
sented in Figure 1. Intuitively, the KNN model of PEFA-XL produces 
its match set by aggregating relevant passages Y; j of training 
queries q; that are in the neighborhood of the test query q. Note 
that the scoring function of ferm in Equation 6 is bounded be- 
tween [-1, 1] as the inner product of two unit-form embeddings 
are bounded by the range of cosine similarity. On the other hand, the 
scoring function of feny in Equation 6 needs an additional normal- 
ization so that its score is calibrated to fgrm. In practice, we consider 
normalizing finn by k, namely Dj; = 1/k if i € NN(q,Q;k), so 
that finn is still upper bounded by 1. 


Complexity Analysis. At inference stage, retrieving the match 
set via PEFA-XL (Equation 6) requires ANN searches on TWO 
distinct output spaces. Specifically, the ERM requires ANN search 
on the passage space P of size n, while the KNN model requires 
ANN search on the query space Q of size m. According to the 
comprehensive review [55], the inference time complexity of HNSW 


WSDM ’24, March 4-8, 2024, Merida, Mexico. 


Chang et al. 


HNSW index size | HNSW inference time 


Methods | Scoring functions | HNSW index building time | 
ERM Equation 1 O(nlog(n)) 

PEFA-XL Equation 6 O(nlog(n) + mlog(m)) 

PEFA-XS Equation 8 O(nlog(n)) 


O(nd + |Ep|) O(log(n)) 
O(nd + |Ep| + md+|Eg|+nnz(Y)) | O(log(n) + log(m)) 
O(nd + |Ep|) O(log(n)) 


Table 1: Comparing time and space complexity of ERM, PEFA-XS, and PEFA-XL at the inference stage. We use the competitive 
graph-based ANN algorithm HNSW [39] as the underlying method for ANN search. The time and space complexity of HNSW is 
induced from [55]. Ep and Eg represents the edges of HNSW graph for P and Q, respectively. 


on a data set S is Ol log(|S D). Thus, the inference time complexity 
of PEFA-XL becomes O(log(n) + log(m)). 

Next, we discuss the space complexity of PEFA-XL, which re- 
quires to store two HNSW indices as well as the query-to-passage 
relevant matrix Y. The space complexity of an HNSW index on a 
data set S is O(|S|d +|Es I) where the former comes from saving 
the database vectors and the latter comes from saving edges of 
the HNSW graph. The space complexity of storing Y is O(nnz(Y)). 
Thus, the space complexity of PEFA-XL is O (nd+ |Ep|+md+|Eg|+ 
nnz(Y)). 

Finally, the time complexity of building HNSW indices for PEFA- 
XL is O(nlog(n) +mlog(m)) because building a HNSW index for a 
set S takes O(|S|log(|S|)) [55]. The building time, inference time, 
and space complexity of PEFA-XL is summarized in Table 1. 


Connections to KNN-LM. Using a non-parametric kNN model 
to improve a parametric neural network has also been studied 
in the context of k nearest neighbors language modeling (kNN- 
LM) [17, 30, 59] and retrieval-augmented LM pre-training [3, 16, 32]. 
kNN-LM interpolates the next-token predictive probability by the 
neural language model and the kNN model. While sharing similar 
intuitions, PEFA-XL is different from kKNN-LM because PEFA-XL 
requires two ANN searches, one on the passage space P and the 
other on the query space Q. In contrast, KNN-LM only needs one 
ANN search on the context space, while inference on the vocab 
space is exact since the size of vocabulary is small. 


3.2 PEFA-XS 


In practice, PEFA-XL can be too expensive to deploy due to storing 
two ANN indices, which double the model storage and inference 
latency. Thus, we seek an efficient alternative of PEFA-XL that only 
needs to maintain single ANN index, hence the name PEFA-XS. 

Recall that PEFA-XL demands another ANN search because 
of finding k nearest queries in Q, namely NN(q,Q;k). We can 
approximate NN(q,Q;k) by the set of relevant queries given a 
target passage pj € P. We denote this alternative query set as 
I (pj, Y) = {il¥ij > 0,i = 1,...,n}, which is a function of pj 
that is independent to the test query q. In other words, the result- 
ing diagonal matrix Dj; = 1 if i € I (pj, Y); other Dj; = 0. The 
approximate kNN model of PEFA-XS becomes 


finn (@ pj) = (å, » Yi: qi) = (GQ'Y,3)- (7) 


ie I (p;.¥) 


By plugging the query-independent kNN model of Equation 7 back 
to Equation 2, we present the scoring function of PEFA-XS 


ferra-xs(q Pj) =A(q, pj) + 0-A QY.) 


Implementation. An illustration of PEFA-XS method is pre- 
sented in Figure 2. Similar to the implementation design of PEFA-XL, 
we need to normalize the scoring function of fin in Equation 7 
such that its score is upper bounded by 1. Thus, we introduce an 
f2 normalization operator I(x) = Tell that projects an embedding 
back to the unit-sphere. We can then rewrite the scoring function 
of PEFA-XS as 


forra-xs(q Pj) = (åA: pj +(1-A)-M(QTY:,)), (8) 


where normalization step II(-) can be absorbed in the design of D. 


Complexity Analysis. Note that the KNN model of PEFA-XS is 
independent to the test query q, so the interpolation of two scoring 
functions can be pre-computed in the embedding space, as derived 
in Equation 8. This suggests that PEFA-XS only requires a single 
ANN index, where a set of n interpolated passage embeddings are 
used. Therefore, the inference of PEFA-XS share the same time 
and space complexity as ERM alone. Specifically, the time complex- 
ity of constructing HNSW index and performing ANN search are 
O(nlog(n)) and O(log(n)), respectively. The space complexity of 
storing ANN index is O(nd + |Ep|). Finally, the time and space 
complexity of PEFA-XS is summarized in Table 1. 


Connections to XMC. Given a passage, aggregating its relevant 
(as defined by customer behavior signals) query embeddings to be 
an alternative passage embeddings of itself has also been explored in 
the extreme multi-label classification (XMC) literature. Specifically, 
XMC community terms such representation as Postive Instance 
Feature Aggregation, namely PIFA embeddings [5, 22, 61, 64]. How- 
ever, PIFA embeddings are often an aggregation of sparse tfidf 
features, and mostly used for unsupervised clustering to partition 
label space in the XMC literature [61]. In contrast, PEFA-XS in- 
terpolates such alternative passage embeddings with the original 
passage embeddings, and conduct ANN search on the interpolated 
ASIN embedding space. As a side note, PIFA embeddings are also 
closely connected to the simple graph convolution layer in graph 
neural network [8, 36, 57], where the input-to-label relevant matrix 
is viewed as a bipartite graph. 


4 EXPERIMENTS ON DOCUMENT RETRIEVAL 


In this section, we empirically verify the effectiveness of PEFA on 
the document retrieval task. Experiment code is available at https: 
//github.com/amzn/pecos/tree/mainline/examples/pefa- wsdm24 


4.1 Datasets & Evaluation Protocols 


Datasets. We conducted experiments on two public benchmarks 
for document retrieval, namely the Natural Questions [31] dataset 
and the Trivia-QA [27] dataset. 
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e Natural Questions [31]: a open-domain question answer- 
ing dataset which consists of 320k query-document pairs, 
where the documents containing answers are gathered from 
Wikipedia and the queries are natural questions. The version 
we use is often referred to as NQ-320K [2, 53, 56]. 

e Trivia-QA [27]: a reading comprehension dataset which in- 
cludes 78k query-document pairs from the Wikipedia do- 
main. We use the same version as in [56]. 


The State-of-the-art (SoTA) method, namely NCI [56] consider gen- 
erated query-document pairs as additional training signals in NQ- 
320K and Trivia-QA datasets. Following the same setup of NCI [56], 
we also include those augmented query-document pairs when learn- 
ing PEFA on NQ-320K and Trivia-QA datasets. 


Evaluation Protocols. We measure the performance with recall 
metrics, which are widely-used in retrieval communities [6, 29, 41, 
53, 56]. Specifically, given a predicted score vector y € R” and 
a ground truth label vector y € {0,1}”, Recall@k is defined as 
Recall@k = igi È jetop, (ġ) Yj Where top(ĝ) denotes labels with 
the k largest predicted scores. 


4.2 Implementation Details 


Comparison Methods. Our PEFA framework is applicable to 
any black-box ERMs. We applied PEFA to various competitive 
ERMs, such as Sent-BERT gistiti [45], DPRpase [29], MPNetbase [50], 
Sentence-T5pase [41] and GTRpase [42]. We also compare PEFA with 
recent SoTA Sequence-to-Sequence (Seq2Seq) models, including 
Differentiable Search Index (DSI) [53], Search Engines with Autore- 
gressive LMs (SEAL) [2] and Neural Corpus Indexer (NCI )[56]. 


Hyper-parameters. PEFA have two hyper-parameters, the in- 
terpolation coefficient A in Eq. 2 and the number of nearest neigh- 
bors k in Eq. 6 used by PEFA-XL only. We present ablation studies 
on hyper-parameters in Section 4.4 where A = {0.1, 0.3, 0.5, 0.7, 0.9} 
and k = {16, 32, 64}. For A = 1.0, PEFA reduce back to its baseline 
ERM. We consider HNSW for ANN search and set hyper-parameters 
according to existing work [5, 38, 43]. At the index building stage, 
the maximum edge per node M = 32 and the size of priority queue 
for graph construction efC = 500. At the online serving stage, the 
beam search width for graph search ef S = 300. 


4.3 Main Results 


NQ-320K. In Table 2, we applied PEFA-XS and PEFA-XL to fine- 
tuned ERMs: Sent-BERT gisti, [45], DPRbase [29], MPNetbase [50], 
Sentence-T5pase [41] and GTRhase [42]. These ERMs were full pa- 
rameter fine-tuned on the NQ-320K dataset. The proposed PEFA 
framework achieved significant improvement for a wide range 
of black-box ERMs. The average gain of PEFA-XS over ERMs are 
+9.22% and +5.29%, for Recall@10 and Recall@100, respectively. 
The average gain of PEFA-XL over ERMs are +11.33% and +5.20% 
for Recall@10 and Recall@100, respectively. For competitive ERMs 
such as MPNetpase [51] and GTRpase [42], PEFA further outper- 
form the previous SoTA Seq2Seq method, namely NCI [56]. 
The Recall@10 and Recall@100 of MPNetpase +PEFA-XL are 88.72% 
and 94.13%, which is considerably better than the previous SoTA 
NCI. On the other hand, the original MPNetpase without PEFA can 
not outperform the previous SOTA method, NCI. 
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Methods Recall@10 Recall@100 
BM-25 32.48 50.54 
DSI (base) [53] 56.60 - 
NCI (base) [56] 85.20 92.42 
SEAL (large) [2] 81.24 90.93 
Sent-BERT gistil! [45] 67.08 81.40 
+PEFA-XS (ours) 80.52 92.22 
+PEFA-XL (ours) 85.26 92.53 
DPRhase [29] 70.68 85.19 
+PEFA-XS (ours) 83.45 92.22 
+PEFA-XL (ours) 84.65 92.07 
MPNetpase [50] 80.82 92.39 
+PEFA-XS (ours) 86.67 94.53 
+PEFA-XL (ours) 88.72 95.13 
Sentence-T5pase [41] 73.63 88.16 
+PEFA-XS (ours) 82.52 92.18 
+PEFA-XL (ours) 83.69 92.55 
GTRhase [42] 79.74 90.91 
+PEFA-XS (ours) 84.90 93.28 
+PEFA-XL (ours) 88.71 94.36 
Avg. Gain of PEFA-XS over ERM +9.22 +5.28 
Avg. Gain of PEFA-XL over ERM +11.82 +5.72 


Table 2: Our PEFA framework on NQ-320K dataset. Both 
Seq2Seq models and ERMs are full-parameter fine-tuned. The 
results of BM-25, DSI [53], NCI [56] and SEAL [2] are taken 
from [56]. 1st/2nd place numbers are boldface/underscore, 
respectively. 


Trivia-QA. In Table 3, we applied PEFA-XS and PEFA-XL to 
pre-trained ERMs. Note that these ERMs were not fine-tuned with 
any relevant query-document pairs from Trivia-QA. The setup 
examines the robustness and generalization of our PEFA framework. 
We observe PEFA-XS and PEFA-XL achieve larger average gain of 
Recall over the unsupervised ERMs, when comparing Table 3 to 
Table 2. When the underlying ERM are pre-traiend only (not fine- 
tuned to the downstream task), PEFA-XS seems to perform slightly 
better than PEFA-XL in Recall@20, where the former has an average 
gain of 18.67% while the latter has an average gain of 17.06%. 


4.4 Ablation Studies 


In Table 4, we present ablation studies of two hyper-parameters of 
our PEFA framework on the NQ-320K dataset. A is the interpola- 
tion coefficient that balances fpr and finn in Equation 2. When 
0.0 < A < 1.0, the Recall@100 of both PEFA-XS and PEFA-XL are 
consistently higher the ERM alone (A = 1.0). For PEFA-XS and 
PEFA-XL, A = 0.5 and A = 0.1 mostly yield the largest gain in av- 
erage, respectively. Crucially, the linear interpolation of PEFA-XS 
can be pre-computed offline at the HNSW index building stage (see 
Figure 2) hence did not increase any inference latency overhead 
compared to the ERMs. For PEFA-XL, besides the hyper-parameter 
A, it has another hyper-parameter k, controlling the number of 
nearest neighbors in the KNN model funn. We observed that k = 32 
generally saturate the performance. 
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Methods Recall@20 Recall@100 
BM-25 69.45 80.24 
NCI (base) [56] 94.45 96.94 
SEAL (large) [2] 81.24 90.93 
Sent-BERTictint [45] 51.94 68.50 
+PEFA-XS (ours) 86.28 93.33 
+PEFA-XL (ours) 83.76 91.83 
DPRbase [29] 60.69 73.80 
+PEFA-XS (ours) 82.97 91.06 
+PEFA-XL (ours) 78.76 89.62 
MPNetpase [50] 77.03 87.34 
+PEFA-XS (ours) 86.05 92.97 
+PEFA-XL (ours) 86.13 92.42 
Sentence-T5pase [41] 62.74 77.21 
+PEFA-XS (ours) 78.39 88.57 
+PEFA-XL (ours) 75.13 87.24 
GTRhase [42] 71.75 82.05 
+PEFA-XS (ours) 83.81 91.02 
+PEFA-XL (ours) 85.27 92.36 
Avg. Gain of PEFA-XS over ERM +18.67 +13.61 
Avg. Gain of PEFA-XL over ERM +17.06 +12.80 


Table 3: Our PEFA framework on Trivia-QA dataset. Seq2Seq 
models are full-parameter fine-tuned while ERMs are un- 
supervisedly pre-trained. ERMs +PEFA did not fine-tune or 
update any parameter of the underlying ERMs. 1st/2nd place 
numbers are boldface/underscore, respectively. 


Recall@100 of various A 


ERM PEPA 0.1 0.3 0.5 0.7 0.9 


PEFA-XS 91.48 92.22 91.71 89.87 87.08 
PEFA-XL (k=16) | 91.98 90.66 89.72 88.54 87.62 
PEFA-XL (k=32) | 92.07 90.50 89.20 88.62 87.46 
PEFA-XL (k=64) | 91.93 89.89 88.95 88.39 87.04 


DPRbase 


PEFA-XS 91.23. 92.16 92.20 91.61 -89.72 
PEFA-XL (k=16) | 92.53 91.25 90.82 90.69 90.24 
PEFA-XL (k=32) | 92.34 91.20 90.96 90.77 90.11 
PEFA-XL (k=64) | 92.22 91.26 91.03 90.70 89.90 


Sentence-T5pase 


PEFA-XS 92.11 93.07 93.31 92.85 91.74 
PEFA-XL (k=16) | 94.36 93.32 92.81 92.53 91.93 
PEFA-XL (k=32) | 94.32 93.23 92.82 92.44 91.79 
PEFA-XL (k=64) | 93.93 93.14 92.76 92.29 91.62 
Table 4: Ablation study of our PEFA framework on NQ-320K 
dataset. PEFA-XS has only one hyper-parameter, namely 
the interpolation coefficient 1. PEFA-XL has two hyper- 
parameters: A and k (number of nearest neighbors). For 
A = 1.0, PEFA-XS and PEFA-XL reduce to the same under- 
lying ERM in Table 2. 


GTRbase 


5 EXPERIMENTS ON PRODUCT SEARCH 


For large-scale product search system, full-parameter fine-tuning 
may take thousands of GPU hours. In this section, we conducted 
experiments on such larger-scale datasets and demonstrated that 
our PEFA framework is an effective and fast technique that offers 
sizable improvements to not only a variety of pre-trained ERMs but 
also the full-parameter fine-tuned ERMs. 
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5.1 Datasets & Evaluation Protocols 


Datasets. We follow similar procedure [5, 26, 36, 43], to collect 
datasets from a large e-commerce product search engine. Based on 
the size of catalog n = |P |, we construct three subsets as follows. 


e ProdSearch-5M: consists of roughly 30 millions of relevant 
query-product pairs, which covers around 10 millions of 
queries and 5 millions of products. 

e ProdSearch-15M: consists of roughly 150 millions of relevant 
query-product pairs, which covers around 40 millions of 
queries and 15 millions of products. 

e ProdSearch-30M: consists of roughly 500 millions of relevant 
query-product pairs, which covers around 100 millions of 
queries and 30 millions of products. 


For all proprietary ProdSearch datasets, the data statistics do not 
reflect the real traffic of the e-commerce system due to privacy 
concerns. All relevant query-product pairs are random samples 
from anonymous aggregated search log. We further split those 
pairs into the training set and the test set by time horizon, where 
we use first twelve months of search logs as the training set and 
the last one month of search logs as the evaluation test set. 


Evaluation Protocol. To eliminate evaluation bias toward our 
PEFA framework, all test queries are unseen in the training set. To 
avoid disclosing the exact performance of production systems, we 
report absolute gain of Recall@k metrics between the proposed 
PEFA framework and the baseline ERMs. 

We also report the ANN index size (GiB) and the index building 
time (hours) in offline indexing stage. For online inference, follow- 
ing the ANN benchmark protocol [1], we consider the single thread 
setup and report the inference latency (milliseconds/query). 


5.2 Main Results 


In Table 5, we applied PEFA to pre-trained ERMs (e.g., MPNetpase [50], 
Sentence-T5pase [41], GTRbase [42] and E5pase [54]) and the fine- 
tuned ERMs (FT-ERM [40]). For privacy of the proprietary product 
search datasets, we only report the absolute gain of Recall metrics 
compared to the MPNethase baseline. 

Without PEFA, pre-trained ERMs have much lower Recall metrics 
compared to FT-ERM, as the latter is carefully pre-trained and fine- 
tuned. Adding PEFA-XS and PEFA-XL to those pre-trained ERMs 
significantly lift the Recall to comparable, or even outperform, the 
fine-tuned FT-ERM. Take the largest dataset ProdSearch-30M as an 
example. Adding PEFA-XL to Sentence-T5hase, GTRbase and E5hase 
have a Recall@100 gain of 30.10%, 31.71% and 31.91%, respectively. 
These recall@100 gain is already outperform the Recall@100 of 
fine-tuned FT-ERM. On the other hand, PEFA-XS on pre-trained 
ERMs offer smaller Recall gain compared to PEFA-XL. Only E5pase 
+PEFA-XS have a larger Recall@100 gain compared to the fine- 
tuned FT-ERM. 

Similar to the finding of NQ-320K, we also see that PEFA can 
further improve the performance of fine-tuned ERMs. For example, 
on the largest dataset ProdSearch-30M, PEFA-XS and PEFA-XL 
further improve the Recall@100 of the fine-tuned FI-ERM by 5.3% 
and 14.50%, respectively. 


PEFA: Parameter-Free Adapters for Large-scale Embedding-based Retrieval Models 


ProdSearch-5M 
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ProdSearch-15M ProdSearch-30M 


Methods Recall@100 Recall@1000 | Recall@100 Recall@1000 | Recall@100 Recall@1000 
MPNetpase [50] 0.00 0.00 0.00 0.00 0.00 0.00 
+PEFA-XS (ours) 11.23 13.14 5.05 11.79 9.67 17.47 
+PEFA-XL (ours) 22.83 12.31 23.48 21.56 27.22 18.96 
Sentence-TSpase [41] 0.44 3.42 1.32 3.44 1.89 5.17 
+PEFA-XS (ours) 13.63 17.13 10.39 16.34 13.18 21.28 
+PEFA-XL (ours) 23.09 13.43 23.91 23.72 30.10 21.25 
GTRbase [42] 7.85 9.23 6.75 10.33 8.35 9.83 
+PEFA-XS (ours) 17.32 19.55 16.83 25.00 18.49 24.38 
+PEFA-XL (ours) 27.79 19.23 27.87 28.75 31.71 24.28 
E5pase [54] 9.93 9.75 9.98 12.98 12.01 12.61 
+PEFA-XS (ours) 19.23 19.18 17.21 27.78 20.11 26.08 
+PEFA-XL (ours) 26.83 17.75 30.48 31.07 31.91 25.49 
FT-ERM [40] 21.32 20.87 21.74 30.04 18.49 24.11 
+PEFA-XS (ours) 23.42 22.17 26.34 34.84 23.79 29.61 
+PEFA-XL (ours) 29.32 22.87 36.54 37.24 32.99 30.01 
Avg. Gain of PEFA-XS 16.97 18.23 15.16 23.15 17.05 23.76 
Avg. Gain of PEFA-XL 25.97 17.12 28.46 28.47 30.79 24.00 


Table 5: Applying PEFA to pre-trained ERMs (MPNetpase [50], Sentence-T5pase [41], GTRpase [42] and E5pase [54]) and the fine- 
tuned ERMs (FT-ERM [40]) on three proprietary product search datasets: ProdSearch-5M, ProdSearch-15M and ProdSearch-30M. 
To avoid disclosing the exact performance of production systems for privacy concerns, all reported numbers are absolute gain 
of Recall metrics compared to the baseline method MPNetpase. 1st/2nd place numbers are boldface/underscore, respectively. 


5.3 Indexing and Inference 


In Table 6, we discuss the trade-off between the performance and 
the deployment efficiency for the proposed PEFA framework. Note 
that PEFA is a parameter-free method without updating model 
parameters of ERMs, which can be easily implemented in the offline 
HNSW index building stage. For the largest dataset ProdSearch- 
30M, the run-time of building HNSW indices for PEFA-XS and 
PEFA-XL are 1.0 and 4.7 hours, respectively. This is much faster 
than hundred of GPU hours when fine-tuning the FT-ERM on the 
billion-scale dataset. 

Despite larger gain in recall metrics, PEFA-XL comes at the 
cost of larger HNSW index, longer index building time, and larger 
inference latency. Specifically, the HNSW index size of PEFA-XL 
is 3.6x larger than the HNSW index of ERM, as PEFA-XL requires 
two HNSW indices: One HNSW index on the product embeddings 
P € R"*4 while the other HNSW index on the training query 
embeddings Q € R”*4 for the KNN modeling. For product search 
datasets, the number of queries m can be larger than the number of 
products n. Due to similar reasons, the inference latency of PEFA-XL 
is 2.4x larger than the latency of ERM. 

On the other hand, PEFA-XS not only achieves modest gains of 
recall metrics, but also maintains the same deployment efficiency 
(e.g., HNSW index size and inference latency) as its baseline ERM. 
Recall that PEFA-XS maintains only one ANN index because the 
interpolation of frm and finn is independent to test-time query, 
which can be pre-computed offline in a single ANN index (see 
Equation 8 in Section 3.2). From the deployment perspective, PEFA- 
XS may be a more practical choice as it introduces zero additional 
overhead to the production system at inference time. 


Datasets Methods : Indexing : serving 
disk-size run-time | Latency 

FT-ERM 13.1 0.3 0.82 

ProdSearch-5M +PEFA-XS 13.1 0.2 0.67 
+PEFA-XL 32.2 0.7 2.15 

FT-ERM 28.6 0.6 0.91 

ProdSearch-15M +PEFA-XS 28.6 0.5 0.94 
+PEFA-XL 100.7 1.9 1.94 

FT-ERM 51.9 0.9 0.77 

ProdSearch-30M +PEFA-XS 51.9 1.0 0.71 
+PEFA-XL 287.7 4.7 1.99 


Table 6: For practical deployment consideration, we report 
the HNSW index size on disk (GiB) and the run-time (hours) 
of PEFA during offline index building stage. We also report 
the inference latency (millisecond/query) of the HNSW index 
for online serving. 


5.4 Effect of Supervised Data Size 


The amount of supervised data (i.e., relevant query-product pairs) 
consumed by PEFA plays a crucial role to the predictive power of 
PEFA-XS and PEFA-XL, run-time of HNSW index building, and 
the model size of resulting HNSW indices. Hence, we present such 
analysis in Figure 3. The amount of supervised data is controlled by 
the sampling ratio {0.05, 0.10, 0.25, 0.50, 0.75, 0.95}. In particular, we 
uniformly sample query-product pairs from the relevance matrix 
Y € {0,1}”*” in Equation 3. 
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Figure 3: The amount of supervised data versus predictive 
power and model size for the proposed PEFA framework. The 
y-axis of the 1st row figures is the recall gain compared to 
the FT-ERM (black line). The y-axis of the 2nd row figures 
is the PEFA model size (GiB). The x-axis of all figures is the 
ratio of supervised data being used. 


With 10% of the relevant query-product pairs sampled from Y, 
PEFA-XS reaches the same level of Recall@100 compared to the fine- 
tuned FT-ERM. With more supervised data, PEFA-XS outperforms 
FT-ERM eventually. What’s more, the model size of PEFA-XS do 
not increase as it consumes more supervised data. 

For PEFA-XL, interestingly, it can achieve significant improve- 
ments in Recall@100 even with just 5% of the supervised data. At 5% 
of the supervised data usage, the resulting HNSW index is around 
1.4x times larger than the HNSW index of FT-ERM and PEFA-XS. 
Also, the inference latency of PEFA-XL seems to be consistently 2x 
larger than the latency of FT-ERM and PEFA-XS across all datasets. 
Again, it is up to the practitioners to decide the trade-off between 
the additional performance gain brought by PEFA-XL and the cost 
of larger index size and inference latency. 


6 RELATED WORK 


6.1 Dense Text Retrieval 


DSSM [21] and C-DSSM [49] utilize multi-layer perceptron and con- 
volutional neural networks while DPR [29] deploys pre-trained neu- 
ral language models (NLMs) like BERT [10]. Some studies attempt to 
improve ERMs by pre-training and adjusting results. Condenser [12] 
pre-trains NLMs with the idea of Funnel-Transformer [9] while 
Co-Condenser [13] re-ranks its retrieval results with an attentive 
cross-encoder. DPTDR [52] applies prompt-tuning [35] to further 
improve the quality dual encoders for ERMs. However, conven- 
tional ERMs could suffer from dealing with tail queries and labels, 
especially when we have an enormous industry-scale index [46]. 
Even though some lines of research attempt to address this issue 
by computing label-centric similarity [47] and multi-view represen- 
tations [65], they are infeasible for industrial production due to the 
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requirement of extensive pre-training and additional cross-attentive 
computations between queries and labels. 

Different from existing approaches, our PEFA has no need of 
pre-training another embedding model, so the enhancement can 
be achieved within an acceptable short period of only ANN search 
indexing. Moreover, the kNN model of training queries can further 
benefit the representation capability of PEFA embedding. 


6.2 Inference with Training Instances 


Similar to our proposed PEFA framework, some studies also lever- 
age training instances in inference for better performance in vari- 
ous research fields. k nearest neighbors language modeling (kNN- 
LM) [17, 30, 59] build a KNN model within the small vocabulary 
space while PIFA embeddings [5, 22, 61, 64] aggregate sparse rep- 
resentations of training instances for the label space. The derived 
coresets of training embeddings can be indexed with ANN search to 
shrink candidate labels and accelerate inference for recommender 
systems [24] and neural language models [7]. However, none of the 
above methods addresses the challenge of large-scale retrieval in 
an industry scale. 


6.3 Parameter-Efficient Tuning of ERMs 


To avoid the expensive full-parameter fine-tuning of ERMs for 
various downstream tasks, there are some preliminary studies on 
parameter-efficient fine-tuning of ERMs [28, 37, 44]. Nevertheless, 
as pointed out by [37], naively apply existing parameter-efficient 
fine-tuning methods in the NLP literature, such as Adapter [18], 
prefix-tuning [33] and LoRA [19], often results in limited success for 
ERM in the retrieval applications. Furthermore, parameter-efficient 
fine-tuning approaches still require access to the models’ gradient, 
which may not be available for the recent powerful large language 
models (LLMs) such as GPT-3 [4]. Our proposed PEFA framework 
is complementary to any pre-trained and fine-tuned ERMs, namely 
including ERMs derived from parameter-efficient fine-tuning. No- 
tice that our PEFA did not require any gradient information of the 
underlying ERMs, which can have a broader impact to black-box 
ERMs where the encoders are initialized from LLMs. 


7 CONCLUSIONS 


In this paper, we propose PEFA, parameter-free adapters for fast tun- 
ing of black-box ERMs. PEFA offers flexible choices (i.e., PEFA-XS 
and PEFA-XL) for practitioners to improve their pre-trained or fine- 
tuned ERMs efficiently, without any updates to model parameters 
of ERMs. PEFA-XL brings more significant gain of Recall@k at the 
cost of doubling the ANN index size and inference latency, while 
PEFA-XS yields modest gain of Recall@k without any overhead 
compared to the existing ERM inference pipeline. For document re- 
trieval, PEFA not only improves the recall@100 of pre-trained ERMs 
on Trivia-QA by an average of 13.2%, but also lifts the recall@100 
of fine-tuned ERMs on NQ-320K by an average of 5.5%. For NQ- 
320K dataset, applying PEFA to MPNetpase [50] and GTRhase [42] 
reaches new SoTA results, where the Recall@10 of 88.72% out- 
performs 85.20% of previous SoTA Seq2Seq-based NCI [56]. For 
product search consisting of billion-scale of data, PEFA improves 
the Recall@100 of the fine-tuned ERMs by an average of 5.3% and 
14.5%, for PEFA-XS and PEFA-XL, respectively. 
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ETHICAL CONSIDERATIONS 


We discuss ethical implications of our PEFA framework in two 
perspectives: interpretability and privacy. 


Interpretability. Given a query, embedding-based retrieval 
models (ERMs) retrieve the match set based on similarity search 
between the query embedding and the corpus of passage embed- 
dings. However, the interpretability and explainability of ERMs is 
quite limited because we do not know which training examples 
contribute to or lead to the decisions of the retrieved match-set. Our 
proposed framework PEFA combines ERMs with a non-parametric 
kNN component, which enhances the interpretability of ERMs. The 
kNN component computes similarity scores between the test query 
and the set of training queries, hence we know which training 
examples contribute the most to the retrieved match-set. 


Privacy. For e-commerce product search, it is crucial to pro- 
tect customers privacy. Thus, we need to insure the underlying 
models do not explicitly memorize customers purchase history. 
When applying PEFA to the product search datasets, we carefully 
anonymized the search log, hence we never know which customer 
issues a specific query. Furthermore, we consider yearly-aggregated 
data of query-product pairs as the training signals in our kNN com- 
ponent. In other words, each query in our training set can not be 
traced back to its original query session. 
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